查看原文
其他

当文本分析遇到乱码(ง'⌣')ง怎么办?

大邓 大邓和他的Python 2022-07-09

做文本分析经常遇到数据乱码问题,一般遇到编码问题我们无能为力,都是忽略乱码的文本。

  1. text = open(file, errors='ignore').read()

但是这样会遗失掉一些信息,那到底怎么治文本分析时经常为非作歹的妖魔鬼怪?

心里默念python大法好!ftfy(fixes text for you)可以为我们整理的乱码数据。

安装

  1. !pip3 install ftfy==5.6

乱码(ง'⌣')ง例子

只我在官方文档上找到这些奇形怪状的字符串,相信大家可能有的也见过这些数据。

  1. (ง'⌣')ง

  2. ünicode

  3. Broken text… it’s flubberific!

  4. HTML entities <3

  5. ¯\\_(ã\x83\x84)_/¯

  6. \ufeffParty like\nit’s 1999!

  7. LOUD NOISES

  8. This — should be an em dash

  9. This text was never UTF-8 at all\x85

  10. \033[36;44mI'm blue, da ba dee da ba doo...\033[0m

  11. \u201chere\u2019s a test\u201d

  12. This string is made of two things:\u2029 1. Unicode\u2028 2. Spite

ftfy.fix_text:专治各种不符

使用ftfy中的fix_text函数可以制伏绝大多数(ง'⌣')à

  1. from ftfy import fix_text


  2. fix_text("(ง'⌣')ง")


  1. "(ง'⌣')ง"


  1. fix_text('ünicode')

  1. 'ünicode'


  1. fix_text('Broken text… it’s flubberific!')

  1. "Broken text… it's flubberific!"


  1. fix_text('HTML entities <3')

  1. 'HTML entities <3'


  1. fix_text("&macr;\\_(ã\x83\x84)_/&macr;")

  1. '¯\\_(ツ)_/¯'


  1. fix_text('\ufeffParty like\nit&rsquo;s 1999!')

  1. "Party like\nit's 1999!"


  1. fix_text('LOUD NOISES')

  1. 'LOUD NOISES'


  1. fix_text('único')

  1. 'único'


  1. fix_text('This — should be an em dash')

  1. 'This — should be an em dash'


  1. fix_text('This text is sad .â\x81”.')

  1. 'This text is sad .⁔.'


  1. fix_text('The more you know 🌠')

  1. 'The more you know 🌠'


  1. fix_text('This text was never UTF-8 at all\x85')

  1. 'This text was never UTF-8 at all…'


  1. fix_text("\033[36;44mI'm blue, da ba dee da ba doo...\033[0m")

  1. "I'm blue, da ba dee da ba doo..."


  1. fix_text('\u201chere\u2019s a test\u201d')

  1. '"here\'s a test"'


  1. text = "This string is made of two things:\u2029 1. Unicode\u2028 2. Spite"

  2. fix_text(text)dd

  1. 'This string is made of two things:\n 1. Unicode\n 2. Spite'


ftfy.fix_file:专治各种不符的文件

上面的例子都是制伏字符串,实际上ftfy还可以直接处理乱码的文件。这里我就不做演示了,大家以后遇到乱码就知道有个叫fixes text for you的ftfy库可以帮助我们fix_text 和 fix_file。


近期文章

python爬虫与文本数据分析 系列课

pip安装问题解决办法

tabulate:好看的字符串表格库

如何理解pandas中的transform函数

计算社会经济学

免费视频课《Python快速入门》

初学Python常见异常错误

Python 函数式编程指北,不只是面向对象哦

一行pandas代码生成哑变量

顺利开班 | python爬虫分析2019年杭州国庆工作坊顺利开班

圆满落幕 | Python 爬虫分析杭州国庆工作坊圆满落幕

文本数据分析文章汇总(2016-至今)






您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存